Skip to content

Conversation

@Sharl0tteIsTaken
Copy link

The comparison with pd.NA incorrectly returned False. This fix overwrites the result with NA value in the original location.

Closes #63328

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.
  • If I used AI to develop this pull request, I prompted it to follow AGENTS.md.

The comparison with pd.NA incorrectly returned False.
This fix overwrites the result with NA value in the original location.

Closes pandas-dev#63328
@Sharl0tteIsTaken
Copy link
Author

Hello @rhshadrach,

Tests

I have tested the purposed changes with command pytest pandas/tests/arrays, result is 19368 passed, 20 skipped, 6 xfailed.

Explain why it is written this way

I originally wanted to use

result[pd.isna(x)] = pd.NA

(suggest by numpy: https://numpy.org/doc/stable/user/how-to-index.html#replace-values-after-filtering), but this will raise TypeError: boolean value of NA is ambiguous. Could not figure out how this happened or how to bypass it, and this also has another problem of how to replace with the original value (I reckon that replace with a default NA is just kicking the can down the road).

So that's why I ultimately chose the current method. I think a test is also required to check the new behaviour, but I wanted to first make sure that this is the right approach and won't cause more problems.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. When changing the behavior, please add tests and a note in the whatsnew of the next minor release (3.1 here).

result = libops.vec_compare(x.ravel(), y.ravel(), op)
else:
result = libops.scalar_compare(x.ravel(), y, op)
result = np.where(isna(x), x, result)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the correct behavior; e.g. 5 = = np.nan should give False. We only want to change the behavior for pd.NA.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to make sure, only replace pd.NA is the intended behavior? so not to replace any of None, pd.NaT, np.nan?

If so, pd.isna() and np.isnan() both aren't gonna work. It seems only the Python is operator is capable to differentiate pd.NA with others. If there's another way to do this, please let me know. Much appreciated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: inconsistency with missing values comparison

2 participants